The complexity landscape of viral genomes

Abstract Background Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes’ organization, relation, and fundamental characteristics. Results This work provides a comprehensive landscape of the viral genome’s complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. Conclusions This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes’ organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.

This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including sub-sequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that dsDNA viruses are on average the most redundant viruses while ssDNA viruses are the lowest. Contrarily, dsRNA viruses show a lower redundancy relative to ssRNA. We extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis to human Herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without using direct comparisons between sequences. This methodology works by combining data compression with simple measures such as GC-content percentage and sequence length followed by machine learning classifiers.

Conclusions:
This manuscript's presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website ( https://asilab.github.io/canvas/ ) for comprehending the viral genome characterization using dynamic and interactive approaches.

Introduction
Viruses are a strong drive force of life and evolution. On average, viruses are the shortest and most abundant life realm, being estimated in around 10 31 particles while occupying almost every ecosystem [1,2,3] and infecting all types of life forms, namely eukaryotes and prokaryotes [4,5].
The dependence on the host's cell forces viruses to interact with cellular pathways to successfully hijack and customise the host cell machinery for viral production has generated a longstanding effect of adaptation and counter-adaptation between host and viruses for gene expression and nucleic acid synthe-• We provide a comprehensive landscape of the viral genomes complexity. • We demonstrate that data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources. • We identify the viral genomes with lower and higher quantity of inversions.
• We use minimal bi-directional complexity profiles as local measures of the viral genome.
• We present an in-depth complexity analysis of the human herpesviruses.
• We show that the viral genome redundancy, GC-content, and size are efficient features to accurately distinguish between viral genomes at different taxonomic levels. • Our work opens new frontiers for studying viral genomes' complexity while depicting complexity trends in viral genomes.
sis. In addition to this co-evolution, during their replication, viruses can perform horizontal gene transfer, increasing the host species' genetic diversity analogously to the process of sexual reproduction [6].
Despite the significant impact that viruses have on the evolution of living beings and the ecosystem, their understanding is still relatively limited compared with other realms of life. In particular, the complexity landscape of viruses is unknown. What are the most redundant and complex viral DNA/RNA sequences? Which viruses contain more genetic inversions? How does the complexity distribution of viruses describe their morphology and behaviour? Additionally, analyzing the complexity of the genome sequence may uncover important information regarding viral processes and distinguish among viral characteristics. Studying the complexity (or quantity of information) of a DNA/RNA sequence requires efficient data compressors that take into account the probabilistic and algorithmic characteristics of the data. Already, several studies have shown the high capability of data compressors as approximations of complexity. In genomics, for example, it has been used to analyse the complexity of different DNA genomes [7], perform rearrangement detection [8] and sequence clustering [9], compute phylogenetic trees [10], perform protein structure prediction [11], compare biological networks [12], and utilised in metagenomic applications [13].
This manuscript presents an extensive complexity analysis of the viral world through the automatic computational analysis of its genome complexity and associated characteristics. Specifically, we use a genomic compressor to analyse the complexity across viral taxonomies and quantify the algorithmic information embedded in viral genome sequences better represented by small programs. Several questions arise when addressing this problem: How much information is present in a viral genome? What is the best way to quantify the information in a viral genome? What type of information can we retrieve from analysing the complexity of the viral genome? To answer them, we use unsupervised probabilistic and algorithmic information quantification in viral genomes. To achieve this, we built a high-quality viral genomes database using the NCBI reference database with 12,168 complete reference genomes from 9,605 viral taxa.
To perform the complexity analysis of these genomes, we made use of the state-of-the-art genome compressor GeCo3 [14]. We compressed each viral sequence using 19 different levels and calculated its normalized compression (NC) to determine the best models to perform the analysis. With this level, we compare the compression of GeCo3 with one of the best general-purpose compressors (PAQ8) and the Block Decomposition Method (BDM) on a synthetic sequence with embedded inverted repeats (IRs). The results show that, unlike other programs, GeCo3 is capable of detecting and compress-ing IRs. We use this knowledge to analyse viruses regarding their complexity and overall abundance of inverted repeats and construct phylogenetic trees. We provide several insights into patterns between the complexity and viral groups. Finally, we show that these measurements can perform viral genome authentication and classification with high accuracy without directly comparing the sequences but rather using the individual features. We demonstrate that efficient data compression is crucial for understanding the viral organization according to the high reported classification accuracy.
This article is organized as follows. In the next section, we describe this paper's background and related work, followed by a description of the methods. Then, we present the main results. Finally, we review the results obtained in the discussion, draw conclusions, and point out possible future work lines.

Background
This manuscript shows that the efficient use of specific data compressors to quantify data complexity (Kolmogorov complexity) profoundly impacts viral genomes identification, classification, and organization. For introducing several concepts, this section provides an overview of the viral nature, Kolmogorov complexity and data compression, and the role of inverted repeats (IR) in the genome sequence.

Viruses Microbiology
Viruses are submicroscopic biological infectious agents that require living cells of an organism to be active for replication [15]. Viruses can exist outside of their host in the form of independent particles named virions, that are composed of the genetic material (DNA or RNA) enclosed by the capsid, which is a protein shell that protects the viral genome while it is extracellular and promotes its entry in the host cells [16].
Other viruses, like bacteriophages, have developed other structures composed by elongated capsids attached to a cylindrical tailed sheet (Figure 1 C) [19].
Others have an outer lipid bilayer named viral envelope (Figure 1 D), which is constituted by a modified form of the host's cell membranes. Viroids have naked genomes, without any protective layer. Like viruses, they use the host's machinery for their replication, but their genomes do not encode proteins [20]. Furthermore, some viruses are dependent on the presence of another virus species in the host cell to be transmitted to new cells. They were named 'satellites' and may represent evolutionary intermediates of viroids and viruses [21,22]. Viral genomes can be of double-stranded DNA (dsDNA), single-stranded DNA (ssDNA), double-stranded RNA (dsRNA) where the capsoid has an helical shape that envelops the genomic material, virus (B) is icosahedral following cubic symmetry , (C) is virus covered by a viral envelop and (D) depicts a complex virus, namely a bacteriophage with a prolate capsid protecting the genomic material. or single-stranded RNA (ssRNA) nature, being linear or circular molecules [23]. The ssRNA viruses can be further classified as positive-or negative-ssRNA, depending on the sense of their RNA strand. These features determine the viral replication and mRNA synthesis pathways. For instance, (+)-ssRNA is directly translated into proteins by the host cell's ribosomes, acting as mRNA. On the other hand, (-)-ssRNA needs to be converted to a (+)-ssRNA by and RNA-dependent RNA polymerase (RdRp) before translation. RdRp also transcribes dsRNA to mRNA (using the negative strand as template) and it is indispensable for the replication of RNA viral genomes. Finally, ssDNA and dsDNA normally make use of the host's DNA-dependent RNA polymerase to form mRNA. However, before this process, ss-DNA is converted to a dsDNA by a DNA polymerase upon cell invasion [24], which is also the enzyme involved in the replication of DNA viruses. The RdRps have a high error rate due to their low proofreading activity and, therefore, replication of RNA viruses is much more prone to mutation than that of DNA viruses [25].
Viruses have a huge size variation, ranging from around 10 nm with small genomes to viruses with similar dimensions and genome size of Bacteria and archaea [26,27]. These viruses are called giant viruses and contain many unique genes currently not found in other life forms.
Although the origin of viruses is still uncertain, they play an important role in the evolution of living organisms since they are horizontal gene transfer vehicles, a biological phenomenon that increases genetic diversity. It allows viral genetic material to occasionally integrate into the host genomes being transferred vertically to its offspring. This property is so preponderant in evolution that the origin of the eukaryotic nucleus might be related with this process [28,29,30].
Additionally, viral genomic integration allows to infer the evolutionary distance between hosts by observing the shared virus integrated into their genomes. For instance, in humans, viruses frequently establish persisting infections and imprint their genetic material in the tissues throughout life, displaying phylogeographies patterns. These can be used as markers to better understand the human population history and migrations and provide new insights into unidentified individuals' origins in both global and local scales [31]. In this respect, the JC polyomavirus is one of the most comprehensively studied virus. Its genotype-specific global spread has been suggested to indicate the origins of modern [32] and ancient humans [33,34,35]. Furthermore, a worldwide study supported the co-dispersal of this virus with major human migratory routes and its co-divergence with human mitochondrial and nuclear markers [36].
Thus, performing computer analysis of viral and host DNA sequences is fundamental to understand the evolutionary relationships between different viruses and their hosts, identify the ancestors of modern viruses, and better understand their behavior and function. Also, the genomic sequences encode not only production of proteins, but also their high-dimensional folding structure [37,38]. Therefore, the direct study of viral genome sequences also develops the knowledge of the viral mechanism of protein formation and assembly.

Inverted Repeats
IRs are nucleotide sequences that have a downstream reverse complement copy, causing a self-complementary base pairing region [39]. Consequently, IRs normally fold into different secondary structures (hairpin-and cruciform-like structures, pseudoknots) that participate or interfere in many cellular processes in all forms of life, including DNA replication [40,25]. Due to these traits, IRs perform an essential role in genome instability [41], which contributes to mutability. In the short term, this mutability can create diseases [42], but across long periods lead to cellular evolution, and genetic diversity [43]. In many viruses, IRs in the form of pseudoknots are involved in ribosomal frameshifting, a translational mechanism that allows the production of different proteins encoded by overlapping open reading frames (ORFs) of the same mRNA [44,45]. This feature allows them to encode a larger amount of genetic information in small genomes and constitutes another level of gene regulation [46]. The genomes of some viruses, such as parvovirus, are flanked by inverted terminal repeats (ITRs) that form hairpin structures functioning as a duplex origin of replication sequence [40,47]. Therefore, these ITRs contain most of the cis-acting information needed for viral replication as well as viral packaging [47]. In adeno-associated viruses, ITRs are essential for intermolecular recombination and circularization of genomes [48]. IRs can also function as termination transcription signals, especially in giant viruses [49,50]. Solomonoff,Kolmogorov,and Chaitin [51,52,53,54] described the notion of complexity by showing that there is at least one optimal algorithm among all the algorithms that decode strings from their codes. For all strings, this algorithm allows codes as short as any other, up to an additive constant that depends only on the strings themselves. Concretely, algorithmic information is a measure that quantifies the information of a string x by determining its complexity K(x) by

Kolmogorov Complexity and Data Compression
where K(s) is defined by a shortest length l of a binary program p that computes the string x on a universal Turing machine U and halts [53]. This notion that the complexity of a string can be defined as the length of a shortest binary program that outputs that string was universally adopted and is the standard to perform information quantification. It differs from Shannon's entropy because it recognises that the source creates structures which follow algorithmic schemes [55,56], rather than regarding the machine as generating symbols from a probabilistic function.
While the Kolmogorov complexity is non-computable, it can be approximated with programs for such purpose. A possible approximation is the Coding Theorem Method (CTM) [57], and its improved version, the Block Decomposition Method (BDM) [58], which approximate local estimations of algorithmic complexity providing a closer relationship to the algorithmic nature. This approximation decomposes the quantification of complexity for segmented regions using small Turing machines [57]. For modelling the statistical nature, such as noise, it commutes into a Shannon entropy quantification. This approach has shown encouraging results for many distinct purposes [59,60,61].
The classical approximation of the Kolmogorov complexity is performed using data-compressors with probabilistic and algorithmic schemes. Data compressors are a natural solution to measure complexity, since, with the appropriate decoder, the bitstream produced by a lossless compression algorithm allows the reconstruction of the original data and, therefore, can be seen as an upper bound of the algorithmic complexity of the sequence. For a definition of safe approximation, see [62].
In genomics, sequences can be codified as messages using a four symbol alphabet (Σ = {A, C, G, T} for DNA sequences and Σ = {A, C, G, U} for RNA sequences). These messages contain instructions for survival and replication of the organism, its' morphology and historical marks from previous generations [63]. Initially, genomic sequences were compressed with general-purpose data-compressors such as gzip [64], bzip2 [65], or LZMA [66]. However, this paradigm shifted towards using a specific compression algorithm after introducing Bio-Compress [67]. Genomic compressors can outperform generalpurpose compressors since they are designed to consider specific genomic properties such as the presence of a high number of copies and substitutional mutations, and multiple rearrangements, such as inverted repeats [68,69].
Given this advantage of using specific compressors for the compression of genomic data, several algorithms have emerged to model these genomic data behaviours [70]. Specifically, algorithms have been created that model repetitions and inverted repetitions in the genome regions through simple bit encoding, dictionary approaches and context modelling [71,72,73,74,75,76,77,78,79,80,81].
Currently, the state-of-the-art genomic compressors apply statistical and algorithmic model mixtures combined with arithmetic encoding. The best compression ratio performance for various genomic sequences is provided by XM [82], Jarvis [83], and Geco3 [14]. The XM compressor [82] uses three types of experts: repeat models, a low-order context model, and a short memory context model. On the other hand, Jarvis [83] uses a competitive prediction model that estimates for each symbol the best class of models to be used. There are two classes of models: weighted context models and weighted stochastic repeat models, where both classes of models use specific sub-programs to handle inverted repeats efficiently. Finally, GeCo3 [14], currently the best performing referencefree compressor, uses neural networks to improve upon the results of specific genomic models of GeCo2 [84]. Specifically, the neural networks are used in mixing multiple contexts, and substitution-tolerant context models of GeCo2 [84]. Furthermore, GeCo3 possess embedded subprograms capable of detecting genome-specific patterns, such as inverted repeats.

Methods
This section describes the measures used in this paper. Specifically, we first define information-based measures: the Normalized Block Decomposition Method, the Normalized Compression (NC) with different subprograms, the normalized compression capacity (NCC), the difference between NCs, and the minimal bi-directional complexity profiles. Afterwards, we define the GC-Content, and the compression benchmark performed. Finally, we described the classification pipeline. Specifically, the features and classifiers used and the metrics utilized for evaluating the model's performance.

Information-based measures
This section describes two approximations of the Kolmogorov complexity, one based on the decomposition of a string into blocks and their approximation based on the output of small Turing machines (Block Decomposition Method) and another based on data compression. The data compression approach was utilized to compute the Normalized Complexity and construct the minimal bi-directional complexity profiles. Therefore, we describe the Normalized Compression (NC), the minimal bi-directional complexity profiles, and the Normalized Block Decomposition Method (NBDM), in this subsection.

Normalized Block Decomposition Method (NBDM)
A possible approximation of the Kolmogorov complexity is given by using small Turing machines (TM), which approximate the components of a broader representation. The Coding Theorem Method (CTM) uses the algorithmic probability between a string's production frequency from a random program and its algorithmic complexity. The more frequent a string is, the lower its Kolmogorov complexity, and the lower frequency strings have, the higher Kolmogorov complexity is. The Block Decomposition Method (BDM) increases the capability of a CTM, approximating local estimations of algorithmic information based on Solomonoff-Levin's algorithmic probability theory. In practice, it approximates the algorithmic information and, when it loses accuracy, it approximates the Shannon entropy. Since in this article we use BDM to perform a comparison with the Normalized Compression, we considered the normalization of the BDM (NBDM) according to [85]. In this case, the NBDM is computed as where x is a string, BDM(x) is the BDM value of the string, |A| the number of different elements in x (size of the alphabet) and |x| the length of x. Since we have a four symbol alphabet (Σ = {A, C, G, T} for DNA sequences and Σ = {A, C, G, U} for RNA sequences), |A| = 4, log 2 4 = 2. Although BDM has difficulty dealing with full information quantification due to the block representability, it has proven to be a helpful tool for measuring and identifying data content similar to simple algorithms [85].

Normalized Compression (NC)
An efficient compressor, C(x), provides an upper bound approximation for the Kolmogorov complexity (K(x)), where K(x) < C(x) ≤ |x| (|x| is the length of string x in the appropriate scale). Usually, an efficient data compressor is a program that approximates both probabilistic and algorithmic sources using affordable computational resources (time and memory). Although the algorithmic nature may be more complex to model, data compressors can have embedded sub-programs to handle this nature. The normalized version, known as the Normalized Compression (NC), is defined by where C(x) is the compressed size of x in bits. Given the normalization, the NC enables to compare the proportions of information contained in the strings independently from their sizes [7]. If the compressor is efficient, then it can approximate the quantity of probabilistic-algorithmic information in data using affordable computational resources. In our work, to determine the NC, we made use of the state-of-the-art genome compressor GeCo3 [14], with the level that yielded the best average results (benchmark provided in the results section). Besides the computation of the NC using the standard configuration of this model, we also computed the NC using GeCo3 with three subprogram configurations. These subprogram con-figurations address the use or absence of inverted repetitions, namely: • IR 0 → uses the regular context model without IR detection; • IR 1 → uses IR detection simultaneously with the regular context model; • IR 2 → uses IR detection sub-program without regular context models.
There was a need to determine the sequences with the highest normalized compression capacity (NCC) in some cases. When the compressor was only using the subprogram IR 2 , NCC was computed as NCC IR 2 (x) = 1 -NC IR 2 . Only positive values were considered to filter computations where the compressor could not compress the sequence sufficiently. Another measure used to quantify inverted repeats was the difference between NC IR 0 and NC IR 1 .

Minimal bi-directional complexity profiles
A complexity profile is a numerical sequence describing for each symbol (x i ) of a sequence x the number of bits required for its compression assuming a causal order [86]. A minimal bi-directional complexity, B(x), profile assumes the minimal representation of compressing the sequences using both directions independently, namely − → C (x i ) as from the beginning to the end of the sequence, and ← − C (x i ) as from the end to the beginning. Accordingly, these profiles are defined as The construction of these profiles follows a pipeline formed of many transformations, including reversing, segmenting, inverting, and the use of specific low-pass filters after data compression to achieve better visualization. For computing these profiles, we use the GTO toolkit [87].
The generation of these profiles is robust to localize specific features in the sequences, namely low and high complexity sequences, inverted repeat regions, duplications, among others.

Other Measures
The two other measures used to perform viral analysis and classification are the GC-Content (GC) and the length of the viral genome |x|.
GC-Content (GC) represents the proportion of guanine (G) and cytosine (C) bases out the quaternary alphabet {A, C, G, T/U}. This includes thymine (T) in DNA and uracil (U) in RNA. The GC percentage is given by the number of cytosine (C) and guanine (G) bases in a viral genome x with length |x| according to where x i is each symbol of x (assuming causal order), Ξ is a subset of the genomic alphabet containing the symbols {G, C} and N the program that counts the numbers of symbols in Ξ.
GC-content is variable between different organisms. In addition, the GC-content value correlates with the organism's life-history traits, genome size [88], and GC-biased gene conversion [89]. As such, this measure is useful to perform viral classification. Furthermore, an organism with a genome high in GC-content is rich in energy and more prone to mutation. Thus, over time, a species tends to decrease its GC-content to become more stable [90], giving us further information regarding viral characterization.

Data Description
The dataset is composed of 12,163 complete reference genomes from 9,605 viral taxa retrieved from NCBI database on 22 of January 2021 using the following url https://tinyurl.com/ncbidtbs. The download was performed in a custom manner to retrieve the taxonomic id, host and geolocation of each reference genome. The metadata header was removed from each sequence using the GTO toolkit [87], where any nucleotide outside the quaternary alphabet {A, C, G, T/U}, was replaced by a random nucleotide from the quaternary alphabet. Notice that the sequences with symbols outside the alphabet are scarce. Finally, the type of genome and the taxonomic description of each sequence were retrieved using Entrez-direct [91].
Then, the retrieved NCBI sequences were filtered to remove possibly contaminated or poorly sequenced sequences. Firstly, using the taxonomic metadata, sequences that did not hold complete taxonomic information down to the genus rank and any sequences that maintained a taxonomic description of unclassified were removed. Secondly, a filter was applied to remove outlier sequences. Specifically, after computing all sequences' length, GC-Content, and Normalized Complexities, sequences whose measure fell outside µ ± 3 × σ (approximately 0.03% of all sequences) of any measure were removed. After filtering, 6,091 of the initial 12,163 sequences were kept.

Compression Model Benchmark
We selected a total of 19 levels of models to determine the best level configuration to compress the viral sequences. These levels correspond to the default 13 levels of the GeCo3 compressor and 6 others built for this task. The list of the levels used are shown in Table S1, and the description of parameters can be found in Table S2. The 13 default levels of the compressor have increasingly higher complexity and take longer to run since they use higher context models. Therefore, since the first and lightest level performed best, the other six custom-build levels were also built with small models.
Linear Discriminant Analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that separates classes of objects. The resulting combination can be used as a linear classifier [92]. Gaussian Naive Bayes is defined as a supervised machine learning classification algorithm based on the Bayes theorem following Gaussian normal distribution [93]. K-Nearest Neighbors is another approach to data classification, taking distance functions into account and performing classification predictions based on the majority vote of its neighbors [94]. Support Vector machines are supervised learning models with associated learning algorithms that construct a hyperplane in a high-dimensional space using data and perform classification [95]. Finally, XGBoost [96] is an efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that predicts a target variable by combining the estimates of a set of simpler models. Specifically, new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. This task uses a gradient descent algorithm to minimize the loss when adding new models. XG-Boost can use this method in both regression and classification predictive modeling problems.
The accuracy and weighted F1-score were used to select and evaluate the classification performance of the measures. Accuracy is the proportion between correct classifications and the total number of cases examined, while the F1-score is computed using the precision and recall of the test. We utilized the weighted version of the F1-Score due to the presence of imbalanced classes.
For comparison of the obtained results, we assessed the outcomes obtained using a random classifier. For that purpose, for each task, we determined the probability of a random sequence being correctly classified (p hit ) as where p(c i ) is the probability of each class, determined as On the other hand, p correct (c i ) is the probability of that class being correctly classified. In the case of a random classifier, |classes| .

Results
The results reported in this manuscript can be computed using the minimal characteristics described in Supplementary Subsection entitled Software and Hardware recommendations and using the procedures described in Supplementary Subsection entitled Reproducibility. The following subsections describe the data, the compression level selection benchmark, the synthetic sequence benchmark, the viral genome analysis and phylogenetic trees, and the viral classification application.

Level selection benchmark
Viral genomes have specific characteristics, for example, short length, high average complexity, and specific structures, that require the proper optimization of the data compressor to provide higher modeling adaptability and efficiency. GeCo3 is a state-of-the-art genomic compressor that contains many types of compression levels [14]. Herein, we used this tool to compress each viral genome from the dataset using 19 different levels and computed its normalized compression (NC). We evaluated the frequency where each level yielded the lowest NC (provided the best compression for a given sequence; Figure 2 A) and determined the sum of the NC from the compression of all reference genomes for each model (Figure 2 B). Overall, level 16 was selected because it provided the lowest NC on average (selected 28.38% as the best compression level) and provided the lowest sum of the NC from compressing all reference genomes. This level is constituted by a mixture using a neural network with the following models: • Model 1 → context-order of 1, alpha parameter of 1 (without inverted repeats), and gamma parameter of 0.7; • Model 2 → context-order of 12, alpha parameter of 1/50 (with inverted repeats), and gamma parameter of 0.97.
The chosen level is constituted by two models with a small and average context model. This configuration performed better because most of the viral genomes are small and compact, where repetitions and IRs are usually separated by a small genomic space. Therefore, the depth of the models is more adapted to provide higher efficiency to the average of the viral genomes than, for example, a higher context model (higher than 13) that can perform marginally better in more extensive and repetitive sequences but that loses sensitivity in the average of the genomes.

Synthetic sequence benchmark
Viral genomes can contain IRs that are subsequences better described using simple algorithmic approaches. To benchmark the capability of different programs to quantify IRs accurately, we created a genomic sequence of 10,000 nucleotides in which the last 5,000 were inverted repeats of the first 5,000. This sequence was mutated incrementally from 0% to 10%, meaning that the number of IRs will decrease with the increase of nucleotide substitutions. For each sequence, the NC was computed with ( Figure 3): i) GeCo3, without and with IR detection program (IR 0 and IR 2 , respectively) and ii) PAQ8 data compressor (one of the best general-purpose data compressors). Additionally, the Normalized Block Decomposition Method (NBDM) was also computed, as a measure more prone for the algorithmic nature quantification. Results show that GeCo3 with the IR 2 subprogram compresses the sequences better than the other programs since its NC is lower at 0% mutational rate (Figure 3). NBDM can also not detect the IRs because it provides the same high value across all sequences with various mutation rates. It is also evident that GeCo3 with IR 2 can detect IRs even in the presence of substantial mutations (5% of mutation) and takes into account different levels of nucleotide substitutions because increases with the increase of the mutational rate (i.e. decrease of IRs). The difference between NC IR 0 and NC IR 1 , both computed with GeCO3, was also analyzed. Its profile is inverse to the IR 2 and confirms that nucleotide substitutions' accumulation decreases the number of IRs in the sequence. Normalized Block Decomposition Method (NBDM) with an increase of mutation rate of a sequence (0%-10%). The NC was computed using the state-of-theart genomic compressor (GeCo3 [14]) and a general-purpose compressor (PAQ8 [97]). The NBDM is depicted by a red line and the NC value using PAQ8 by a purple line. Furthermore, the GeCo3 compressor with (IR 2 ) and without the IR detection subprogram (IR 0 ) is shown in orange and blue lines, respectively.
Finally, the green line shows the difference between NC IR 0 -NC IR 1 .

Viral genome analysis and Phylogenetic trees
The core of the viral genomes was analyzed in terms of complexity landscape, including the trends, singularities, and patterns for both the use or absence of IRs. The NC, using GeCo3, with IR 0 , IR 1 and IR 2 subprograms was determined and the NCC IR 2 was calculated. The outcome was interpreted according to the genome type or the taxonomic group, together with the average of their genome sizes ( Figure 4 and Table S3). Notice that the NC enables to compare proportions of the absence of redundancy independently from the sizes of the genomes. This value is complementary to the normalized redundancy. Specifically, consider the redundancy (R) of a sequence x, as where |x| is the length of the sequence, A is the cardinally of the sequences' alphabet and C(x) is the compressed size of x in bits, and the normalized redundancy (NR) as NR(x) = 1-(C(x)/log 2 (A)|x|).

Complexity landscape according to genome type
According to NCBI, the virus's genomes herein analyzed are of five types: dsDNA, ssDNA, dsRNA, ssRNA and mixed-DNA. Results show that ssDNA, followed by mixed-DNA and dsRNA viruses, are the genomes with higher NC, whereas dsDNA genomes have the lowest (Figure 4 A; Table S3). In general, smaller genomes are less complex and are more likely to contain fewer repeats and, hence, less redundancy, and the ss-DNA, mixed-DNA and dsRNA genomes have smaller average sequence lengths (3282 bp, 3258 bp, and 8377 bp; Table S3). According to the NCC and the NC IR 0 -NC IR 1 difference results, dsDNA and ssDNA have most significant quantities of IRs than the other genome types. This can be due to ITRs present at the ends of some dsDNA viruses, such as Adenovirus and Ampullaviruses, and ssDNA virus as Parvoviruses, or other IRs structures important that perform ribosomal frameshifting.

Complexity landscape according to taxonomic level
In complexity analysis of viral genomic sequences, when considering the Realm taxonomic level (Figure 4 B), the lowest NC values were obtained for Adnaviria, Varidnaviria and Duplodnaviria (Table S4 and S5). These results are consistent with the genomic grouping since they are composed exclusively of dsDNA viruses and have the highest sequence lengths. Thus, generally, an inverse correlation between genome size and NC was also observed as with the genome type analysis (Figure 4 A  and B) and occurs across all taxonomic levels (Table S5). However, within these three Realms, Adnaviria has the lowest sequence length and presented a lower NC than Varidnaviria and Duplodnaviria, suggesting that the last are highly complex.
Regarding IRs, Adnaviria was the realm where the highest compression was obtained using the IR 2 subprogram (highest rate of IRs; Table S6). Consequently, its only recognized kingdom, Zilligvirae, has also one of the highest NCC values (Table S6). Adnaviria is a realm constituted of mostly A-form dsDNA viruses, and the ends of their genomes contain ITRs [98]. A-form is proposed to be an adaptation allowing DNA survival under extreme conditions since their hosts are hyperthermophiles and acidophiles microorganisms from the archaea domain [98,99]. The fact that Adnaviria presented the lowest NC might indicate that their genomes require redundancy to survive such extreme environments. The kingdom Trapavirae, belonging to the realm Monodnaviria, is also composed by dsDNA viruses that infect halophilic archaea. Together with kingdom Zilligvirae, Trapavirae presented the highest difference between IRs and standard compression (Table S7). These results also support the fact that IRs can stabilize the DNA of viruses that exist in extreme environments. It has already been demonstrated that archaeal viruses with linear genomes use diverse solutions for protection and replication of the genome ends, such as including covalently closed hairpins and terminal IRs [100].
At the family level, Botourmiaviridae presented the highest complexity, followed by Alphasatellitidae and Tolecusatellitidae families (Table S5). Botourmiaviridae is composed of ssRNA viruses that infect plants, and filamentous fungi [101]. Curiously, plants and fungi have higher redundancy despite the lower redundancy of their pathogens. Alphasatellitidae and Tolecusatellitidae are families of satellite viruses that depend on the presence of another virus (helper viruses) to replicate their genomes. These satellite viruses have minimal genomes, making sense that they possess very low redundancy. Regarding IRs, Malacoherpesviridae, Herpesviridae, and Rudiviridae contained the highest NC IR 0 -NC IR 1 difference (Table S7). Malacoherpesviridae and Herpesviridae are dsDNA viruses evolutionarily close since they belong to the order Herpesvirales [102]. Malacoherpesviridae encompasses the genera Aurivirus and Ostreavirus, which infect molluscs. Herpesviridae are also known as herpesviruses and have reptiles, birds and mammals as hosts. This family will be discussed in more detail in the following subsection. Rudiviridae is a family of viruses with linear dsDNA genomes that also infect archaea. The virus of these families are highly thermostable and can act as a template for site-selective and spatially controlled chemical modification. Furthermore, the two strands of the DNA are covalently linked at both ends of the genomes, which have long ITRs [103]. Again, these IRs could be an adaptation to stabilize the genome.

Complexity landscape of the family Herpesviridae
Here we analyzed the complexity landscape of the genera of the family Herpesviridae in more detail, and results show a significant variation between them (Figure 5 A). Mardivirus had the highest NC IR 0 -NC IR 1 difference among all viruses, and only other three genera (out of thirteen) of herpesviruses  were within the ten highest differences list (Table S7) A particular group of family Herpesviridae are the human herpesviruses (HHVs). These viruses are involved in globally prevalent infections and cancers and characterized by lifelong persistence with reactivations that can potentially manifest life-threatening conditions [104]. Globally, the HHVs present a higher redundancy relative to other viruses (Figure 5 B). These viruses are divided into: i) the alpha-subfamily members, namely herpes simplex virus type 1 and 2 (HSV-1 and HSV-2) and varicella-zoster virus (VZV), ii) the beta-subfamily of human cytomegalovirus (HCMV) and human herpesviruses 6A, 6B, and 7 (HHV-6A, HHV-6B, and HHV-7) and iii) the gammasubfamily of Epstein-Barr virus (EBV) and Kaposi's sarcomaassociated herpesvirus (KSHV). Specifically, the EBV, one of the most potent cell transformation and growth-inducing viruses known, capable of immortalizing human B lymphocytes, contains a higher redundancy than the other HHVs (Figure 5  B). The other gamma-herpesvirus, KSHV, is the genome with the highest NC IR 1 (Figure 5 B). Unlike the beta-and gammasubfamilies, the alpha-subfamily is characterized by a substantial quantity of IRs, as suggested by the NCs with IR 1 and IR 2 configurations (Figure 5 B). The VZV has the shortest genome and the highest NC within this group. These differences might be justified by the different rates of evolution within these genomes [105]. Considering the beta-subfamily members, HCMV contains a small proportion of IRs while having a substantial-high NC relative to most other HHVs being analyzed. Since the HCMV has the largest genome, this was surprising because the NC typically has an inverse correlation with the genome size and the quantity of IRs. The other betasubfamily members are the Human Herpesvirus 6A, 6B, and 7, which produced lower NCs (with IR 1 and IR 2 configurations) compared to the other HHVs, with a low quantity of IRs, an effect that their integrating function might favour. For instance, HHV-6A and 6B integrate their genomes into the telomeres of latently infected cells [106,107]. Thus, their genomes contain subsequences similar to the human telomere regions that can be formed by internal nucleotide repetitions [108]. As such, these are sequences with very low complexity and, hence, highly compressible.

Alternative visualization methods of the viral complexity landscape
Phylogenetic trees were generated depicting the redundancy (NC; Figure 6 A) and the prevalence of inverted repeats (NCC; Figure 6 B) on each taxonomic branch. In addition, we performed the same analysis to portray the relation between inverted and internal repetitions ( Figure S1). These phylogenetic trees show the broad picture of the regions with more complex and less redundant sequences, regions rich in inverted repeats, and regions with a higher prevalence of inverted repeats relative to standard repetitions in the genomes.
Another way to analyze the results is by producing 3Dscatter plots of randomly sampled values obtained from computing the features sequence length (SL), NC and GC-content (GC; Figure 7 A) or 2D-scatter plots of their projections (Figure 7 B and 7 C), both concerning a particular taxonomic level (herein Realm). Analyzing the sequence length projections (Figure 7 B), it is evident that there is a logarithmic downtrend of the NC with the increase in sequence length. Thus, although longer sequences have, on average, greater complexity (absolute quantities), they have higher redundancy, which the data compressor takes advantage of to perform a better compression. On the other hand, the NC vs the GC-content displays a normal distribution around the 0.5 GC-mark, with higher complexities associated with similar frequency of occurrence of the four bases A, C, G, T/U (Figure 7 C). This result also makes sense since, in principle, a well-distributed frequency of bases makes more complex sequences to compress. More importantly, the NC, GC and SL seem to discriminate between different taxonomic groups (Figure 7). As such, in the following section, we analyze the classification capability of these features.

Viral Classification
In this section, we performed eight different classification tasks for each viral sequence from the dataset. Specifically, the sequences were classified regarding their genome type, realm, kingdom, phylum, class, order, family, and genus.
We conducted a random 80-20 train-test split on the dataset to perform viral classification. Due to classes being imbalanced in the dataset, several actions were performed. First, we did not consider classes with less than four samples. As such, depending on the classification task, the number of samples decreased from 6,091 to the values shown in Table S8 (N. Classes column). Secondly, we performed the train-test split in a stratified way to ensure the representability of each label in the train and test sets. Finally, instead of performing k-fold cross-validation, we performed the random train-test split fifty times, and we retrieved the average of the evaluation metrics. Then, we computed the Accuracy and the Weighted F1-score to select the best performing method.
Furthermore, we performed classification using seven different features: sequence length (SL), GC-content (GC), the Normalized Compression (NC) values for the best performing model, and the NC of the same model with IR configuration to 0, 1 and 2.
These seven features were fed to all the classifiers, and the accuracy and weighted F1-score were measured to determine which classifier was best suited for this task.
Tables S8 and S9 depict the accuracy and weighted F1-score values obtained for each classifier. For all classification tasks, the best performing classifier was the XGBoost classifier.
Following this, we analyzed if all features were necessary. For that purpose, the XGBoost classifier was used with only the NC feature, the NC with SL and GC, and finally, using all features. The obtained accuracies are shown in Table 1, and the weighted F1-score results are shown in Table S10. Except for the genome classification, where the usage of the NC, GC and SQ provided slightly better results than using all features, the remaining results yielded the best result when using all features. This improvement increased when the number of classes was higher, demonstrating that the different compression subprograms (IR 0 , IR 1 , and IR 2 ) are more helpful in classifying more specific taxonomic groups.
Regarding the results, there is decrease in accuracy and F1-score when there is an increase in the number of classes. Specifically, we obtained the best performance in the realm classification of the virus (accuracy -92.41%, F1-score -0.9214) and our lowest performance in genus classification (accuracy -68.42%, F1-score -0.6525). This decrease is mainly because the average number of samples per class decreases as the number of classes increases. As such, many classes may have insufficient number of samples to be accurately classified. Figure S2 represents the number of samples (genome sequences) per viral genus. We minimized this impact by removing classes from the classification that possessed less than four samples. Furthermore, part of the classification inaccuracies can be explained by possible errors in the assembly process of the original sequence or eventual sub-sequence contamination of parts of the genomes. Moreover, other inaccuracies could be due to several genomes being reconstructed using older methods that have been improved since then [109].
As far as we know, this is the first attempt at performing this type of reference-free classification. As such, for comparison purposes, we assessed the outcomes obtained using a random classifier. Specifically, for each task, we determined the probability of a random sequence being correctly classified (p hit ). Overall there is a vast improvement relative to the random classifier, showing the importance of the features used in the classification process. These classification results seem promising, showing that this metric can be utilized for viral taxonomic classification if enough sequence samples are pro- vided.

Discussion
The usage of a specialized compressor is crucial to quantify the complexity present in a genome accurately. Specialized compressors outperform general-purpose compressors because they take into account the intrinsic nature of the data. Genomic data is highly heterogeneous and has high substitution mutations and data rearrangements, such as fusions, translocations, and inversions [68,69]. Therefore, the ability of a genomic data compressor to adapt to this heteroge-neous data, being able to perform an accurate structure modelling and detect repetitions in the presence of the high substitutional mutations and rearrangements in genomic data is fundamental to achieve high compressibility of the genome sequence. This article evaluates the capacity to identify dataspecific patterns in genomic sequences by comparing the potential of tree methods to recognize IRs. Precisely, the NBDM was estimated, and the NC was computed using a genomic compressor (GeCo3 [14]) and a general-purpose data-compressor (PAQ8 [110,111]). When GeCo3 had the subprogram activated that detects IRs (NC IR 2 ), it showed substantially higher compression than general-purpose because PAQ uses models that do not consider these specific properties of the genomic se- Table 1. Results obtained for viral taxonomic classification task regarding the genome type, realm, kingdom, phylum, class, order, family, and genus using XGBoost classifier. The features used were the genome's sequence length (SL), the GC-content (GC) and the Normalized Compression (NC) values for the best model, the same model with IR configuration to 0, to 1 and 2. The results correspond to the accuracy (ACC), and the probability of a random sequence being correctly classified (p hit ) using a random classifier (p hit (C Random ).  quences. The same occurs when comparing GeCo3 (NC IR 2 ) with NBDM, showing that despite NBDM being able to detect small subprograms in synthetic data [85], it cannot detect IRs in genomic data. Moreover, GeCo3 compression capability was resistant to substitutional mutation up to 10%, showing that it can also deal with this extreme nature of genomic data, namely approximate IRs.
On average, RNA viruses mutate faster than DNA viruses, double-strand viruses mutate slower than single-stranded viruses, and genome size correlates negatively with mutation rate [112]. In this article, we have shown that the redundancy of dsDNA is higher than ssDNA, but for RNA viruses, the opposite occurs. The sequences used in this study to measure a lower NC (higher normalized redundancy) of the ssRNA to dsRNA have approximately the same length. However, the dataset of dsRNA has less than one order of magnitude in the number of sequences. This difference is natural since the ssRNA is much more abundant than dsRNA. Nevertheless, this discrepancy could justify the higher normalized redundancy of ssRNA in the first instance. However, although the lower average NC values of ssRNA are similar to dsRNA, the dsRNA has higher NC extremes. Therefore, we argue that this difference in the number of sequences in the dsRNA is not significant in changing the lower average of the ssRNA. Also, ssRNA are more prone to mutation than dsRNA [113]. On the other hand, extensive C to U mutations have been reported in many mammalian RNA viruses [114]. This behaviour was detected during a much faster evolution of the SARS-CoV-2, an ssRNA virus [115]. Therefore, the faster average decrease of GC-content in ssRNA viruses explains a decrease in the ssRNA entropy and, hence, average NC. A higher GC-content (approximately 2%) of the dsRNA over ssRNA strengthens these outcomes (Table S3).
We performed an analysis of the human herpesvirus regard-ing their genome complexity and IRs abundance. Specifically, we analyzed the various behaviours of their subfamilies and identified that different complexities could be representative of the different rates of evolution within these genomes. Finally, we suggest that maybe a lower NC and abundance of inversions present in herpesvirus are associated with viral genome integration.
Lastly, we evaluated the capability of using complexity measures to perform viral classification at different taxonomic levels. Notably, results showed that we can automatically and accurately distinguish between viral genomes at different taxonomic levels using the XGBoost classifier with all features (NC with different configurations, GC-content and SL). However, a decrease in accuracy when approaching the lowest taxonomic levels was observed, which can be increased with future entries to the database. Finally, despite the high accuracy results obtained, further improvement of the results may be possible in the classification by adding the transcribed viral proteome information.

Conclusion
This manuscript shows that the efficient approximation of the Kolmogorov complexities of viral sequences as measures that quantify the absence of redundancy have a profound impact on genomes identification, classification, and organization.
For computing an upper bound of the sequence complexity, we benchmark a specific data compressor (GeCo3), after optimization, against other approaches, namely a high compression ratio general-purpose data compressor (PAQ) and a measure that combines small algorithmic programs and Shannon entropy (BDM). Unlike the other approaches, we show that GeCo3 can efficiently address and quantify regions properly described by simple algorithmic sources, namely exact and approximate inverted repeats, among other characteristics.
Using an optimized compression level of GeCo3 in an extensive viral dataset, we provide a comprehensive landscape of the viral genome's complexity, comparing the viral genomes at several taxonomic levels while identifying the genome regarding the lowest and highest proportion of complexity. Specifically, on average, dsDNA viruses are the most redundant (less complexity) according to their size, and ssDNA the less redundant. Contrarily, dsRNA shows a lower redundancy relative to ssRNA.
We perform an in-depth analysis of the human herpesvirus regarding their genome complexity and abundance of IRs. We induce that a lower NC and abundance of inversions present in herpesvirus may be associated with viral genome integration.
We describe and use minimal bi-directional complexity profiles of one sequence of each virus to visualize the distribution of complexity of these sequences locally. These profiles can describe actual regions detected in the genome with other meth-ods, proving the description capability of data compression at a structural level.
We reveal the importance of efficient data compression in genome classification tasks, explicitly showing that the complexity, when combined with simple measures (GC-content and size), is efficient to accurately distinguish between viral genomes at different taxonomic levels without using direct comparisons between sequences.
The methods and results presented in this work provide new frontiers for studying viral genomes' complexity while magnifying the importance of developing efficient data compression methods for automatic and accurate viral analysis.

Availability of supporting data and materials Website
You can access a support website to this paper at https://asilab.github.io/canvas/. This site showcases, among other things, the pipeline of this study, the compressor's model selection, the detection of inverted repeats in synthetic genomic sequences, the viral genome characterization with regards to genome and type of taxonomic group, and the computed phylogenetic trees with a magnifier to allow a better observation of the normalized complexity results with illustrative examples of viruses.

Competing Interests
The authors declare no competing interests.

Author's Contributions
J.M.S. and D.P. designed the experiment, executed data analysis and wrote the manuscript. All authors analysed and discussed the results and revised the manuscript.

Supplementary Material
Here, we depict the supplementary material of the article. The supplementary material is described in four main sections: Compression Model Benchmark, Viral Genome Analysis, Classification, Software and Hardware recommendations, and Reproducibility. The Compression Model Benchmark, Viral Genome Analysis and Classification sections have auxiliary material to their corresponding sections of the main article. On the other hand, the Software and Hardware recommendations section defines minimum requirements, and the Reproducibility section describes how to reproduce the results obtained in this article.

Compression Level Benchmark
Herein, it is depicted the supplementary material to the Compression Levels Benchmark of the methods section. Table S1 describes the parameters used in the six costume build levels. The flag "tm" is the template of a target context model, the flag "lr" defines the learning rate, and the flag "hs" defines the number of hidden nodes for the neural network. Table S1. Depiction of the parameters used in the six costume levels.

Viral Genome Analysis
We present the supplementary material discussed in the Viral Genome Analysis of this main article.  Table. The first represents the highest 10 NC values using standard settings NC (best performing model); the second group shows the top 10 lowest NC values obtained using the IR 2 subprogram. Finally, the third group shows the top 10 highest values of the difference between NC using IR 0 and IR 1 subprograms.
Tables S5,S6,and S7 organize the top taxa (by taxonomic group) regarding their normalized compression (NC), normalized compression capacity (NCC) and difference. The tables also shows the genomes' average Sequence Length and GC-Content.
Finally, Figure S1 depicts the phylogenetic tree with average NC difference (NC IR 0 -NC IR 1 > 0) for each viral taxonomic group up to the viral genus. The colour red depicting the highest NC difference, and the blue the lowest. subprograms.

Classification
Herein, we show the supplementary classification tables that are discussed in the classification subsection of this article. Figure S2 represents the number of samples (genome sequences) per viral genus. Table S8 and Table S9 show the values obtained using different classifiers for accuracy and F1-score, respectively. In both cases, the XGBoost classifier had the best performance.  Figure S1. Phylogenetic tree showing average difference (NC IR 0 -NC IR 1 > 0).The colour red depicts the branches where on average, the genome possesses more inverted repetitions than internal repetitions (higher difference), whereas the blue colour represents the branches with fewer inverted repetitions than internal repetitions (smaller difference).
Table S10 displays the XGBoost classifier F1-score results when using different sets of features. With the notable exception of the type of genome classification, the best results were obtained using all features.

Software and Hardware recommendations
The experiences of the manuscript can be replicated using a laptop, desktop, or server computer running Arch linux or Linux Ubuntu (for example, 18.04 LTS or higher) with GCC (https://gcc.gnu.org),git and git LFS, Conda (https://docs.conda.io) and python version 3.6. The hardware must contain at least 8 GB of RAM and a 100 GB disk.